{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Processing HTML Files\n",
    "\n",
    "We will be using **html2parquet transform**\n",
    "\n",
    "References\n",
    "- [html2parquet](https://github.com/data-prep-kit/data-prep-kit/tree/dev/transforms/language/html2parquet/python)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step-1: Data\n",
    "\n",
    "We will process data that is downloaded using [1_crawl_site.ipynb](1_crawl_site.ipynb).\n",
    "\n",
    "We have a couple of crawled HTML files in  `input` directory. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step-2: Configuration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "## setup path to utils folder\n",
    "import sys\n",
    "sys.path.append('../utils')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "## All config is defined here\n",
    "from my_config import MY_CONFIG"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✅ Cleared  output directory\n"
     ]
    }
   ],
   "source": [
    "import os, sys\n",
    "import shutil\n",
    "\n",
    "shutil.rmtree(MY_CONFIG.OUTPUT_DIR, ignore_errors=True)\n",
    "shutil.os.makedirs(MY_CONFIG.OUTPUT_DIR, exist_ok=True)\n",
    "shutil.os.makedirs(MY_CONFIG.OUTPUT_DIR_HTML, exist_ok=True)\n",
    "shutil.os.makedirs(MY_CONFIG.OUTPUT_DIR_MARKDOWN, exist_ok=True)\n",
    "\n",
    "print (\"✅ Cleared  output directory\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step-3: HTML2Parquet\n",
    "\n",
    "Process HTML documents and extract the text in markdown format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "01:19:34 INFO - html2parquet parameters are : {'output_format': <html2parquet_output_format.MARKDOWN: 'markdown'>, 'favor_precision': <html2parquet_favor_precision.TRUE: 'True'>, 'favor_recall': <html2parquet_favor_recall.TRUE: 'True'>}\n",
      "01:19:34 INFO - pipeline id pipeline_id\n",
      "01:19:34 INFO - code location None\n",
      "01:19:34 INFO - data factory data_ is using local data access: input_folder - input output_folder - output/1-html2parquet\n",
      "01:19:34 INFO - data factory data_ max_files -1, n_sample -1\n",
      "01:19:34 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.html'], files to checkpoint ['.parquet']\n",
      "01:19:34 INFO - orchestrator html2parquet started at 2025-03-28 01:19:34\n",
      "01:19:34 INFO - Number of files is 20, source profile {'max_file_size': 0.28227996826171875, 'min_file_size': 0.1003103256225586, 'total_file_size': 2.61575984954834}\n",
      "01:19:34 INFO - Completed 1 files (5.0%) in 0.005 min\n",
      "01:19:34 INFO - Completed 2 files (10.0%) in 0.005 min\n",
      "01:19:34 INFO - Completed 3 files (15.0%) in 0.006 min\n",
      "01:19:34 INFO - Completed 4 files (20.0%) in 0.006 min\n",
      "01:19:34 INFO - Completed 5 files (25.0%) in 0.006 min\n",
      "01:19:34 INFO - Completed 6 files (30.0%) in 0.006 min\n",
      "01:19:34 INFO - Completed 7 files (35.0%) in 0.007 min\n",
      "01:19:34 INFO - Completed 8 files (40.0%) in 0.007 min\n",
      "01:19:34 INFO - Completed 9 files (45.0%) in 0.007 min\n",
      "01:19:34 INFO - Completed 10 files (50.0%) in 0.008 min\n",
      "01:19:34 INFO - Completed 11 files (55.0%) in 0.008 min\n",
      "01:19:34 INFO - Completed 12 files (60.0%) in 0.008 min\n",
      "01:19:35 INFO - Completed 13 files (65.0%) in 0.012 min\n",
      "01:19:35 INFO - Completed 14 files (70.0%) in 0.013 min\n",
      "01:19:35 INFO - Completed 15 files (75.0%) in 0.013 min\n",
      "01:19:35 INFO - Completed 16 files (80.0%) in 0.013 min\n",
      "01:19:35 INFO - Completed 17 files (85.0%) in 0.013 min\n",
      "01:19:35 INFO - Completed 18 files (90.0%) in 0.014 min\n",
      "01:19:35 INFO - Completed 19 files (95.0%) in 0.014 min\n",
      "01:19:35 INFO - Completed 20 files (100.0%) in 0.015 min\n",
      "01:19:35 INFO - Done processing 20 files, waiting for flush() completion.\n",
      "01:19:35 INFO - done flushing in 0.0 sec\n",
      "01:19:35 INFO - Completed execution in 0.015 min, execution result 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✅ Operation completed successfully\n"
     ]
    }
   ],
   "source": [
    "from dpk_html2parquet.transform_python import Html2Parquet\n",
    "\n",
    "result = Html2Parquet(input_folder= MY_CONFIG.INPUT_DIR, \n",
    "               output_folder= MY_CONFIG.OUTPUT_DIR_HTML, \n",
    "               data_files_to_use=['.html'],\n",
    "               html2parquet_output_format= \"markdown\"\n",
    "               ).transform()\n",
    "\n",
    "if result == 0:\n",
    "    print (f\"✅ Operation completed successfully\")\n",
    "else:\n",
    "    raise Exception (f\"❌ Operation  failed\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step-4: Inspect the Output\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Successfully read 20 parquet files with 20 total rows\n",
      "Output dimensions (rows x columns)=  (20, 6)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>document</th>\n",
       "      <th>contents</th>\n",
       "      <th>document_id</th>\n",
       "      <th>size</th>\n",
       "      <th>date_acquired</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>thealliance_ai_focus-areas-foundation-models-d...</td>\n",
       "      <td>thealliance_ai_focus-areas-foundation-models-d...</td>\n",
       "      <td># Open Foundation Models and Datasets\\n\\n### E...</td>\n",
       "      <td>a3b82b0f98d1458175b52e597115f3b329b5d7ba79e224...</td>\n",
       "      <td>5076</td>\n",
       "      <td>2025-03-28T01:19:35.158600</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>thealliance_ai_focus-areas-skills-education_te...</td>\n",
       "      <td>thealliance_ai_focus-areas-skills-education_te...</td>\n",
       "      <td># Skills &amp; Education\\n\\n### Supporting global ...</td>\n",
       "      <td>d98ef830df5e293bb7903e021b60194e8b4e529ef4824b...</td>\n",
       "      <td>334</td>\n",
       "      <td>2025-03-28T01:19:35.190791</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>thealliance_ai_core-projects-trust-and-safety-...</td>\n",
       "      <td>thealliance_ai_core-projects-trust-and-safety-...</td>\n",
       "      <td>Much like other software, generative AI (“GenA...</td>\n",
       "      <td>49b5d414088c84ed87d57b0f58ad01756f86a108eff666...</td>\n",
       "      <td>770</td>\n",
       "      <td>2025-03-28T01:19:34.872435</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>thealliance_ai_governance_text.html</td>\n",
       "      <td>thealliance_ai_governance_text.html</td>\n",
       "      <td># AI Alliance Program Governance\\n\\n### The fo...</td>\n",
       "      <td>2ebae522be4ed86e2809e9f49a011c4c7ce2eaf1eeb9e0...</td>\n",
       "      <td>14848</td>\n",
       "      <td>2025-03-28T01:19:35.235162</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>thealliance_ai_core-projects-the-living-guide-...</td>\n",
       "      <td>thealliance_ai_core-projects-the-living-guide-...</td>\n",
       "      <td># The Living Guide to Applying AI\\n\\nProjectAp...</td>\n",
       "      <td>2baeeebebd08f790702e5f9eddd659f601e8187f27aad1...</td>\n",
       "      <td>859</td>\n",
       "      <td>2025-03-28T01:19:34.851579</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               title  \\\n",
       "0  thealliance_ai_focus-areas-foundation-models-d...   \n",
       "1  thealliance_ai_focus-areas-skills-education_te...   \n",
       "2  thealliance_ai_core-projects-trust-and-safety-...   \n",
       "3                thealliance_ai_governance_text.html   \n",
       "4  thealliance_ai_core-projects-the-living-guide-...   \n",
       "\n",
       "                                            document  \\\n",
       "0  thealliance_ai_focus-areas-foundation-models-d...   \n",
       "1  thealliance_ai_focus-areas-skills-education_te...   \n",
       "2  thealliance_ai_core-projects-trust-and-safety-...   \n",
       "3                thealliance_ai_governance_text.html   \n",
       "4  thealliance_ai_core-projects-the-living-guide-...   \n",
       "\n",
       "                                            contents  \\\n",
       "0  # Open Foundation Models and Datasets\\n\\n### E...   \n",
       "1  # Skills & Education\\n\\n### Supporting global ...   \n",
       "2  Much like other software, generative AI (“GenA...   \n",
       "3  # AI Alliance Program Governance\\n\\n### The fo...   \n",
       "4  # The Living Guide to Applying AI\\n\\nProjectAp...   \n",
       "\n",
       "                                         document_id   size  \\\n",
       "0  a3b82b0f98d1458175b52e597115f3b329b5d7ba79e224...   5076   \n",
       "1  d98ef830df5e293bb7903e021b60194e8b4e529ef4824b...    334   \n",
       "2  49b5d414088c84ed87d57b0f58ad01756f86a108eff666...    770   \n",
       "3  2ebae522be4ed86e2809e9f49a011c4c7ce2eaf1eeb9e0...  14848   \n",
       "4  2baeeebebd08f790702e5f9eddd659f601e8187f27aad1...    859   \n",
       "\n",
       "                date_acquired  \n",
       "0  2025-03-28T01:19:35.158600  \n",
       "1  2025-03-28T01:19:35.190791  \n",
       "2  2025-03-28T01:19:34.872435  \n",
       "3  2025-03-28T01:19:35.235162  \n",
       "4  2025-03-28T01:19:34.851579  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from file_utils import read_parquet_files_as_df\n",
    "\n",
    "output_df = read_parquet_files_as_df(MY_CONFIG.OUTPUT_DIR_HTML)\n",
    "\n",
    "print (\"Output dimensions (rows x columns)= \", output_df.shape)\n",
    "\n",
    "output_df.head(5)\n",
    "\n",
    "## To display certain columns\n",
    "#parquet_df[['column1', 'column2', 'column3']].head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'thealliance_ai_focus-areas-foundation-models-datasets_text.html'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "output_df.iloc[0,]['title']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'thealliance_ai_focus-areas-foundation-models-datasets_text.html'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "output_df.iloc[0,]['document']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "content length: 5076 \n",
      "\n",
      "# Open Foundation Models and Datasets\n",
      "\n",
      "### Enabling an ecosystem of open foundation models, including those with multilingual and multi-modal capabilities, and open datasets.\n",
      "\n",
      "We are responsibly enhancing the ecosystem of open foundation models and datasets. We are embracing multilingual and multimodal models, as well as science models tackling broad societal issues like climate change and education.\n",
      "\n",
      "To aid AI model builders and application developers, we’re collaborating to develop and promote open-source tools for model training, tuning, and inference. We are also launching programs to foster the open development of AI in safe and beneficial ways, and hosting events to explore AI use cases.\n",
      "\n",
      "Without good datasets, model training and tuning would be impossible. We are promoting the development of open datasets with clear governance and provenance controls so they can be used without concerns for legal and other risks.\n",
      "\n",
      "## Current or recent projects\n",
      "\n",
      "![](https://images.prismic.io/ai-alliance/Z1IMWJbqstJ98Fvj_OpentrusteddatainitiativeWebsite.png?auto=format%2Ccompress&fit=max&w=3840)\n",
      "\n",
      "\n",
      "![](https://images.prismic.io/ai-alliance/Z1IMWJbqstJ98Fvj_OpentrusteddatainitiativeWebsite.png?auto=format%2Ccompress&fit=max&w=3840)\n",
      "\n",
      "### Open Trusted Data Initiative\n",
      "\n",
      "Foundation Models and Datasets\n",
      "\n",
      "Cataloging and managing trustworthy datasets.\n",
      "\n",
      "![abstract gradient](https://images.prismic.io/ai-alliance/ZjK4AkMTzAJOCen3_Def5c47d4c357524eecedabb6c6549c94096x2731.jpeg?auto=format%2Ccompress&fit=max&w=3840)\n",
      "\n",
      "\n",
      "![abstract gradient](https://images.prismic.io/ai-alliance/ZjK4AkMTzAJOCen3_Def5c47d4c357524eecedabb6c6549c94096x2731.jpeg?auto=format%2Ccompress&fit=max&w=3840)\n",
      "\n",
      "### Time Series Data and Model Initiative\n",
      "\n",
      "Foundation Models and Datasets\n",
      "\n",
      "Time-series applications are an important target for AI. In addition to gathering high-quality and fully-governed time series datasets as part of the Open Trusted Data Initiative, Alliance members are collaborating on new and improved time series models (as part of the Industry Open FMs Initiative and benchmarks, both general-purpose and application-specific.\n",
      "\n",
      "Please join us. We need time series and domain experts, including especially subject matter experts and use case and product owners who would like to apply emerging time series foundation models to new applications. There is an acute shortage of good, open datasets for time series and data specially benchmarks and evaluation methods for various use cases. Contributions are especially welcome here.\n",
      "\n",
      "![abstract gradient](https://images.prismic.io/ai-alliance/ZjK4bkMTzAJOCen6_61158110d0017fde4179c419f197ecb7.png?auto=format%2Ccompress&fit=max&w=3840)\n",
      "\n",
      "\n",
      "![abstract gradient](https://images.prismic.io/ai-alliance/ZjK4bkMTzAJOCen6_61158110d0017fde4179c419f197ecb7.png?auto=format%2Ccompress&fit=max&w=3840)\n",
      "\n",
      "### Industry Open FMs Initiative\n",
      "\n",
      "Foundation Models and Datasets\n",
      "\n",
      "We have seen rapid progress in building and releasing highly-capable and open foundation models for general language, coding, scientific discovery, and multi-modal scenarios.\n",
      "\n",
      "A key development in model strategies is a focus on building smaller, more specialized models.\n",
      "\n",
      "More details are coming soon, but we would love for you to join us. We need both model-building and domain experts, including those outside the target domains listed above.\n",
      "\n",
      "## Working groups\n",
      "\n",
      "\n",
      "### AI for Drug Discovery\n",
      "\n",
      "This working group aims to create a world-class research community that harnesses the potential of AI foundation models, transforms the field of drug discovery, and accelerates scientific progress by driving interdisciplinary collaboration on AI-powered drug discovery projects in the open.\n",
      "\n",
      "\n",
      "### Climate and Sustainability Working Group\n",
      "\n",
      "Our working group uses AI to solve urgent climate challenges. Through open collaboration, we are developing AI models and complimentary tools to drive sustainable solutions for Earth monitoring, environmental preservation, and climate action.\n",
      "\n",
      "![Working Group for Materials and Chemistry](https://images.prismic.io/ai-alliance/Z2W2BJbqstJ98vrv_blog_fm4m2_3_D_fc60c29f04.jpeg?auto=format%2Ccompress&fit=max&w=3840)\n",
      "\n",
      "\n",
      "### Materials and Chemistry Working Group\n",
      "\n",
      "![Working Group for Materials and Chemistry](https://images.prismic.io/ai-alliance/Z2W2BJbqstJ98vrv_blog_fm4m2_3_D_fc60c29f04.jpeg?auto=format%2Ccompress&fit=max&w=3840)\n",
      "\n",
      "We aim to curate datasets, tasks and benchmarks for materials science, build out foundation models in chemistry for prediction of properties, experimental outcomes or generation of new candidates and create a framework to foster collaboration between human experts and AI agents that will ultimately help solve global urgent challenges in sustainability and safety of materials.\n",
      "\n",
      "![abstract gradient](https://images.prismic.io/ai-alliance/Zi0Ufd3JpQ5PTOSw_2007835b3a02f48a82d5cd6785ee4be1.png?auto=format%2Ccompress&fit=max&w=3840)\n",
      "\n",
      "\n",
      "### Open Foundation Models and Data\n",
      "\n",
      "![abstract gradient](https://images.prismic.io/ai-alliance/Zi0Ufd3JpQ5PTOSw_2007835b3a02f48a82d5cd6785ee4be1.png?auto=format%2Ccompress&fit=max&w=3840)\n"
     ]
    }
   ],
   "source": [
    "## Display markdown text\n",
    "print ('content length:', len(output_df.iloc[0,]['contents']), '\\n')\n",
    "print (output_df.iloc[0,]['contents'])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "## display markdown in pretty format\n",
    "# from IPython.display import Markdown\n",
    "# display(Markdown(output_df.iloc[0,]['contents']))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step-5: Save the markdown"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✅ Saved 20 md files into 'output/2-markdown'\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "\n",
    "for index, row in output_df.iterrows():\n",
    "    html_file = row['document']\n",
    "    base_name = os.path.splitext(os.path.basename(html_file))[0]\n",
    "    md_output_file = os.path.join(MY_CONFIG.OUTPUT_DIR_MARKDOWN, base_name +  '.md')\n",
    "    \n",
    "    with open(md_output_file, 'w') as md_output_file_handle:\n",
    "        md_output_file_handle.write (row['contents'])\n",
    "# -- end loop ---       \n",
    "\n",
    "print (f\"✅ Saved {index+1} md files into '{MY_CONFIG.OUTPUT_DIR_MARKDOWN}'\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "dpk-rag-html-1-r1.1.0",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
